The purpose of our analysis involves exploring potential methods of increasing ridership for the DC Captial Bikeshare program. This motivation in focusing on ridership aligns with the DC Bikeshare’s main mission in transforming the Washingtion Metropolitan Area and residing communities by providing an affordable bicycle transit system. The goals outlined in this targeted reformation includes decreasing traffic congestion, promoting health and wellness, decreasing air pollution, and expanding transportation options. The DC Biekshare program is owned by several jurisdictions however, is mainly managed by the DC local government.
More information pertaining to the bikeshare program can be found by following this link: https://capitalbikeshare.com/
Following the methods outline in the CRISP-DM Methodology, we initially focused on understanding the various business aspects to our problem before analyzing any data. We determine our business objectives by researching the background of the DC Bikeshare program, as seen above. Next, we determine our data mining goals and identified a overall question we aim to solve for our analysis.
This question pertains to the following: How do we make it safer for people to ride bikes as a means of increasing ridership?
Baseline packages for our analysis:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # because we will probably see some dates
library(here) # more easily access files in your project
## Warning: package 'here' was built under R version 4.3.2
## here() starts at U:/DS241/Final-Bikeshare-Project_DS241_Team-4
library(mapview)
## Warning: package 'mapview' was built under R version 4.3.2
library(gbfs)
## Warning: package 'gbfs' was built under R version 4.3.2
library(sf) # working with simple features - geospatial
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(tmap)
## Warning: package 'tmap' was built under R version 4.3.2
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.3.2
library(stringr)
library(raster)
## Warning: package 'raster' was built under R version 4.3.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.2
##
## Attaching package: 'raster'
##
## The following object is masked from 'package:dplyr':
##
## select
library(readr)
library(maps)
## Warning: package 'maps' was built under R version 4.3.2
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:purrr':
##
## map
library(spData)
## Warning: package 'spData' was built under R version 4.3.2
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(viridis)
## Warning: package 'viridis' was built under R version 4.3.2
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
##
## The following object is masked from 'package:maps':
##
## unemp
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.3.2
library(ggbeeswarm)
## Warning: package 'ggbeeswarm' was built under R version 4.3.2
We are initially looking at the following three data sets:
bikes_september <- read_csv(here("data_raw","202309-capitalbikeshare-tripdata.csv")) %>% clean_names()
## Rows: 450090 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bikes_lanes = st_read(here("data_raw", "Bicycle_Lanes.geojson")) %>% clean_names()
## Reading layer `Bicycle_Lanes' from data source
## `U:\DS241\Final-Bikeshare-Project_DS241_Team-4\data_raw\Bicycle_Lanes.geojson'
## using driver `GeoJSON'
## Simple feature collection with 2275 features and 26 fields
## Geometry type: LINESTRING
## Dimension: XYZ
## Bounding box: xmin: -77.08773 ymin: 38.82359 xmax: -76.93066 ymax: 38.98276
## z_range: zmin: 0 zmax: 0
## Geodetic CRS: WGS 84
csv_bikes_lanes = read_csv(here("data_raw", "Bicycle_Lanes.csv")) %>% clean_names()
## Rows: 2275 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): ROUTENAME, SUBBLOCKKEY, BIKELANE_PARKINGLANE_ADJACENT, BIKELANE_TH...
## dbl (8): ROUTEID, ROADTYPE, QUADRANT, TOTALBIKELANES, TOTALBIKELANEWIDTH, W...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
csv_bikes_crashes = read_csv(here("data_raw", "Crashes_in_DC.csv")) %>% clean_names()
## Rows: 67621 Columns: 58
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): REPORTDATE, ROUTEID, FROMDATE, ADDRESS, WARD, EVENTID, MAR_ADDRESS...
## dbl (45): X, Y, OBJECTID, CRIMEID, CCN, MEASURE, OFFSET, STREETSEGID, ROADWA...
## lgl (1): TODATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dc_shape = st_read(here("data_raw","DC_Health_Planning_Neighborhoods.geojson")) %>% clean_names()
## Reading layer `DC_Health_Planning_Neighborhoods' from data source
## `U:\DS241\Final-Bikeshare-Project_DS241_Team-4\data_raw\DC_Health_Planning_Neighborhoods.geojson'
## using driver `GeoJSON'
## Simple feature collection with 51 features and 8 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -77.11976 ymin: 38.79165 xmax: -76.9094 ymax: 38.99556
## Geodetic CRS: WGS 84
class(bikes_lanes)
## [1] "sf" "data.frame"
class(dc_shape)
## [1] "sf" "data.frame"
class(csv_bikes_crashes)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
class(csv_bikes_lanes)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
mapview(dc_shape, lwd=3, alpha=0.2) +
mapview(bikes_lanes, zcol = "streetname", lwd = 3, layer.name = "bike_lanes", legend = FALSE)
Filtering the crash data. First, we looked at variables that we are interested in analyzing from the total data set. We then filtered the data to only look at accidents relating the bicycles. This was determined by setting a requirement that there must be at least 1 bicycle involved in the accident. Columns not relevant to our analysis was removed from the data set.
df1_bike_crashes = csv_bikes_crashes %>%
filter(total_bicycles >= "1") %>%
dplyr::select(-c(locationerror,objectid, crimeid, ccn, reportdate,routeid, streetsegid, roadwaysegid, todate,eventid, majorinjuries_driver, majorinjuries_pedestrian, majorinjuriespassenger,minorinjuries_driver, minorinjuries_pedestrian, minorinjuriespassenger,fatal_driver, fatal_pedestrian, fatalpassenger,unknowninjuries_driver,unknowninjuries_bicyclist,unknowninjuries_pedestrian,unknowninjuriespassenger, mar_address,mar_score,mar_id,total_vehicles,total_pedestrians,total_taxis,total_government,pedestriansimpaired,driversimpaired, speeding_involved, lastupdatedate))
sf_bike_crashes <- df1_bike_crashes %>%
st_as_sf(coords = c("longitude","latitude"), crs = 4326) %>%
st_cast("POINT")
class(sf_bike_crashes)
## [1] "sf" "tbl_df" "tbl" "data.frame"
mapview(dc_shape, lwd=3) +
mapview(bikes_lanes, zcol = "streetname", lwd = 3, layer.name = "bike_lanes", legend = FALSE) +
mapview(sf_bike_crashes, color = "orange", lwd = 0.5, layer.name = "bike_crashes", legend = FALSE)
We will pivot longer the crash data to reduce the number of variables and increase the number of observations. The data is originally separated column-wise based on the type of injuries however, we aim to create a single column that lists the injuries as observations instead. Also, since each type of injury was listed for every street name, any values equivalent to 0, was dropped.
df1_pivot_bike_crashes = df1_bike_crashes %>%
pivot_longer(cols=c("majorinjuries_bicyclist","minorinjuries_bicyclist","fatal_bicyclist"),
names_to = "injuries",
values_to = "number_of_injuries") %>%
dplyr::select(-c(bicyclistsimpaired)) %>%
relocate(c(injuries, number_of_injuries), .after = ward) %>%
mutate(injuries = recode(injuries, majorinjuries_bicyclist = "major",
minorinjuries_bicyclist = "minor",
fatal_bicyclist = "fatal" )) %>%
filter(number_of_injuries >= "1")
Next, we filter the bike crash data to only show bicycle related crashes in September 2023. We chose this time period specifically because the ridership data is only applicable to this time range as well.
Sept_df1_pivot_bike_crashes = df1_pivot_bike_crashes %>%
filter(grepl("2023/09", fromdate))
We will move on to filtering the bike lane data. Similarly for the crash data, we will also pivot longer this set to create a single column that lists out the different type of bike lanes into one column. We wanted columns that separated the “adjacent_to_bikelane” and “protection_of_lane” observations thus, we had to pivot the table twice.
We will not look at the contraflow lane as it does not fit into the data we would like to look at, which includes the location of the bike lane adjacent to motorized vehicles lanes and the design of the lane that provide protection. Contraflow only deals with the direction the lane is flowing in relations to traffic. Also, there is significant NA in the observations, which we removed.
df1_pivot_bike_lanes = bikes_lanes %>%
pivot_longer(col=c("bikelane_parkinglane_adjacent","bikelane_throughlane_adjacent","bikelane_pocketlane_adjacent"),
names_to = "adjacent_to_bikelane",
values_to = "direction_of_lane_adjacent") %>%
pivot_longer(col=c("bikelane_conventional","bikelane_dual_buffered","bikelane_protected", "bikelane_buffered"),
names_to = "protection_of_bikelane",
values_to = "direction_of_lane_protection") %>%
dplyr::select(-bikelane_contraflow) %>%
relocate(c(adjacent_to_bikelane,direction_of_lane_adjacent, protection_of_bikelane, direction_of_lane_protection), .after = streettype) %>%
mutate(adjacent_to_bikelane = recode(adjacent_to_bikelane, bikelane_throughlane_adjacent = "throughlane", bikelane_pocketlane_adjacent = "pocketlane", bikelane_parkinglane_adjacent = "parkinglane")) %>%
mutate(protection_of_bikelane = recode(protection_of_bikelane, bikelane_protected = "protected", bikelane_buffered = "buffered", bikelane_conventional = "conventional", bikelane_dual_protected = "dual_protected", bikelane_dual_buffered = "dual_buffered")) %>%
drop_na(c(direction_of_lane_adjacent, direction_of_lane_protection))
Since we filter the data sets as df objects, we have to make sure both sets are set to sf objects before plotting again.
class(df1_pivot_bike_lanes)
## [1] "sf" "tbl_df" "tbl" "data.frame"
class(Sept_df1_pivot_bike_crashes)
## [1] "tbl_df" "tbl" "data.frame"
The crash data needs to be set to a sf object.
sf_Sept_df1_pivot_bike_crashes <- Sept_df1_pivot_bike_crashes %>%
st_as_sf(coords = c("longitude","latitude"), crs = 4326) %>%
st_cast("POINT")
sf_df1_pivot_bike_crashes <- df1_pivot_bike_crashes%>%
st_as_sf(coords = c("longitude","latitude"), crs = 4326) %>%
st_cast("POINT")
class(sf_Sept_df1_pivot_bike_crashes)
## [1] "sf" "tbl_df" "tbl" "data.frame"
Now, that both sets are in the correct classification, we will attempt to plot these sets.
First, we will look at the different classification of the bike lanes, mainly the protection_of_bikelane variable.
Defining a palette for the maps
pal = mapviewPalette("mapviewSpectralColors")
mapview(dc_shape, lwd=2, alpha = 0.5, alpha.regions = 0.2) +
mapview(df1_pivot_bike_lanes, zcol = "protection_of_bikelane", burst = TRUE, lwd = 3, alpha = 0.8, layer.name = "bike_lanes", legend = TRUE)
mapview(dc_shape, lwd=2, alpha = 0.5, alpha.regions = 0.2) +
mapview(df1_pivot_bike_lanes, lwd = 2, zcol ="protection_of_bikelane",burst= TRUE, alpha = 0.8, layer.name = "bike_lanes", legend = TRUE) +
mapview(sf_Sept_df1_pivot_bike_crashes, zcol = "injuries", burst= TRUE, col.region = pal, lwd = 2, layer.name = "bike_crashes", legend = TRUE)
After reducing the crash data, we were able to get our code to work, thus we will plot another map including more observations from the crash data.
mapview(dc_shape, lwd=2, alpha = 0.5, alpha.regions = 0.2) +
mapview(df1_pivot_bike_lanes, lwd = 2, zcol ="protection_of_bikelane",burst= TRUE, alpha = 0.8, layer.name = "bike_lanes", legend = TRUE) +
mapview(sf_df1_pivot_bike_crashes, zcol = "injuries", burst= TRUE, col.region = pal, lwd = 2, layer.name = "bike_crashes", legend = TRUE)